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CLAIMS 

1. A method for automatically filtering a corpus of documents containing textual 
and non-textual information of a natural language, the method being characterized in that 
5 it comprises the steps of: 

- dividing the corpus of documents into appropriate portions; 

- determining for each portion of the corpus of documents a regularity value (V R ) 
measuring the conformity of the portion with respect to character sequences probabilities 
predetermined for said language; 

10 - comparing each regularity value with a threshold value (V T ) to decide whether the 

conformity is sufficient; and 

- rejecting any portion of the corpus of documents whose conformity is not 
sufficient. 

15 2. Method according to Claim 1, wherein said character sequences probabilities is 

derived from a statistical model representative of said language. 

3. Method according to Claim 2, wherein for each portion of the corpus of 
documents, said regularity value (V R ) is based on a computed perplexity of the portion 

20 with respect to said statistical model. 

4. Method according to Claim 2, wherein said statistical model is previously 
elaborated from a reference document determined as conforming with the rules of said 
language. 

25 

5. Method according to Claim 2, wherein said statistical model is being determined 
according to N-gram statistics. 

6. Method according to Claim 2, wherein said statistical model is a character-based 
30 N-gram model. 
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7. Method according to Claim 2, wherein said statistical model is initially used to 
filter a first corpus segment of a predetermined size to provide a first filtered segment of 
the corpus of documents, said first filtered segment serving as a basis for computing a 
more accurate statistical model which is to be used to filter the rest of the corpus of 

5 documents. 

8. Method according to Claim 1, wherein said threshold value (V T ) is determined by 
executing the following steps of: 

- defining a test corpus as a subset of the corpus of documents to be filtered; 

10 - manually cleaning said test corpus so as to obtain a cleaned test corpus which is 

representative of the type of textual information that is considered as being sufficiently in 
conformity with the language rules and a rejected test corpus that is the complement of 
said cleaned test corpus; 

- computing a perplexity value for each of said cleaned and rejected test corpora 
15 with regard to said statistical model; and 

- setting the threshold value searched between the perplexity values computed. 

9. Method according to Claim 1, wherein said portions comprise lines, paragraphs, 
and whole documents - whose size is determined as a function of the overall size of the 

20 corpus of documents or as a function of the nature of the documents contained in the 
corpus of documents or both, so as to obtain a granularity desired for the filtering. 

10. An apparatus for automatically filtering a corpus of documents containing 
textual and non-textual information of a natural language, the apparatus being 

25 characterized in that it comprises: 

- means for dividing the corpus of documents into appropriate portions; 

- means for determining for each portion of the corpus of documents a regularity 
value measuring the conformity of the portion with respect to character sequences 
probabilities predetermined for said language; 

30 
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- means for comparing each regularity value with a threshold value to decide 
whether the conformity is sufficient; and 

- means for rejecting any portion of the corpus of documents whose conformity is 
not sufficient. 

5 

11. Apparatus according to Claim 10, wherein said character sequences 
probabilities are derived from a statistical model representative of said language. 

12. Apparatus according to Claim 11, wherein for each portion of the corpus of 
10 documents, said regularity value (V R ) is based on a computed perplexity of the portion 

with respect to said statistical model. 

13. Apparatus according to Claim 11, wherein said statistical model is previously 
elaborated from a reference document determined as conforming with the rules of said 

15 language. 

14. Apparatus according to Claim 11, wherein said statistical model is being 
determined according to N-gram statistics. 

20 15. Apparatus according to Claim 11, wherein said statistical model is a character- 

based N-gram model. 

16. Apparatus according to Claim 1 1, wherein said statistical model is initially used 
to filter a first corpus segment of a predetermined size to provide a first filtered segment 

25 of the corpus of documents, said first filtered segment serving as a basis for computing a 
more accurate statistical model which is to be used to filter the rest of the corpus of 
documents. 

17. Apparatus according to Claim 10, wherein said threshold value (V T ) is 
30 determined by executing the following steps of: 

- defining a test corpus as a subset of the corpus of documents to be filtered; 
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- manually cleaning said test corpus so as to obtain a cleaned test corpus which is 
representative of the type of textual information that is considered as being sufficiently in 
conformity with the language rules and a rejected test corpus that is the complement of 
said cleaned test corpus; 

5 - computing a perplexity value for each of said cleaned and rejected test corpora 

with regard to said statistical model; and 

- setting the threshold value searched between the perplexity values computed. 

18. Apparatus according to Claim 10, wherein said portions comprise lines, 
10 paragraphs, and whole documents - whose size is determined as a function of the overall 
size of the corpus of documents or as a function of the nature of the documents contained 
in the corpus of documents or both, so as to obtain a granularity desired for the filtering. 

I s 

Cm 1 9. A computer system comprising an apparatus according to Claim 10. 

01 15 

20. A computer program comprising software code portions for performing a 
W method according to Claim 1, when said computer program is loaded and executed by a 

p computer system. 

In 20 21. A computer-readable program storage medium which stores a program for 

Cf executing a method for automatically filtering a corpus of documents containing textual 

and non-textual information of a natural language, the method being characterized in that 

it comprises the steps of: 

- dividing the corpus of documents into appropriate portions; 

25 - determining for each portion of the corpus of documents a regularity value (V R ) 

measuring the conformity of the portion with respect to character sequences probabilities 
predetermined for said language; 

- comparing each regularity value with a threshold value (V T ) to decide whether the 
conformity is sufficient; and 

30 - rejecting any portion of the corpus of documents whose conformity is not 

sufficient. 
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22. Computer-readable program storage medium according to Claim 21, wherein 
said character sequences probabilities is derived from a statistical model representative of 
said language. 

23. Computer-readable program storage medium according to Claim 22, wherein 
for each portion of the corpus of documents, said regularity value (V R ) is based on a 
computed perplexity of the portion with respect to said statistical model. 

24. Computer-readable program storage medium according to Claim 22, wherein 
said statistical model is previously elaborated from a reference document determined as 
conforming with the rules of said language. 

25. Computer-readable program storage medium according to Claim 22, wherein 
said statistical model is being determined according to N-gram statistics. 

26. Computer-readable program storage medium according to Claim 22, wherein 
said statistical model is a character-based N-gram model. 

27. Computer-readable program storage medium according to Claim 22, wherein 
said statistical model is initially used to filter a first corpus segment of a predetermined 
size to provide a first filtered segment of the corpus of documents, said first filtered 
segment serving as a basis for computing a more accurate statistical model which is to be 
used to filter the rest of the corpus of documents. 

28. Computer-readable program storage medium according to Claim 21, wherein 
said threshold value (V T ) is determined by executing the following steps of: 

- defining a test corpus as a subset of the corpus of documents to be filtered; 

- manually cleaning said test corpus so as to obtain a cleaned test corpus which is 
representative of the type of textual information that is considered as being sufficiently in 
conformity with the language rules and a rejected test corpus that is the complement of 
said cleaned test corpus; 
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- computing a perplexity value for each of said cleaned and rejected test corpora 
with regard to said statistical model; and 

- setting the threshold value searched between the perplexity values computed. 

29. Computer-readable program storage medium according to Claim 21, wherein 
said portions comprise lines, paragraphs, and whole documents - whose size is 
determined as a function of the overall size of the corpus of documents or as a function of 
the nature of the documents contained in the corpus of documents or both, so as to obtain 
a granularity desired for the filtering. 
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